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Abstract. Regular expressions (REs), because of their succinctness and 
clear syntax, are the common choice to represent regular languages. How- 
ever, efficient pattern matching or word recognition depend on the size of 
the equivalent nondeterministic finite automata (NFA). We present the 
implementation of several algorithms for constructing small e-free NFAs 
from REs within the FAdo system, and a comparison of regular ex- 
pression measures and NFA sizes based on experimental results obtained 
from uniform random generated REs. For this analysis, nonredundant 
REs and reduced REs in star normal form were considered. 

1 Introduction 

Regular expressions (REs), because of their succinctness and clear syntax, are the 
common choice to represent regular languages. Equivalent deterministic finite au- 
tomata (DFA) would be the preferred choice for pattern matching or word recog- 
nition as these problems can be solved efficiently by DFAs. However, minimal 
DFAs can be exponentially bigger than REs. Nondeterministic finite automata 
(NFA) obtained from REs can have the number of states linear with respect to 
(w.r.t) the size of the REs. Because NFA minimization is a PSPACE-complete 
problem other methods must be used in order to obtain small NFAs usable for 
practical purposes. Conversion methods from REs to equivalent NFAs can pro- 
duce NFAs without or with transitions labelled with the empty word (e-NFA). 
Here we consider several constructions of small e-free NFAs that were recently 
developed or improved }Mir66|Ant96ICZ02)HSW01)IY03a)COZ07]. and tha t are 
related with the one of Glushkov and McNaughton-Yamada Glu61 MY60|. The 
NFA size can be reduced by merging equivalent states lIY03blISOY05] . Another 
solution is to simplify the REs before the conversion EKSW05 . Gruber and 
Gulan [GG09 showed that REs in reduced star normal form (snf) achieve some 
conversion lower bounds. Our experimental results corroborate that REs must be 
converted to reduced snf. In this paper we present the implementation within 
the FAdo system [FAdlOj of several algorithms for constructing small e-free 
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NFAs from REs, and a comparison of regular expression measures and NFA 
sizes based on experimental results obtained from uniform random generated 
REs. We consider nonredundant REs and REs in reduced snf in particular. 



2 Regular Expressions and Finite Automata 

Let E be an alphabet (set of letters). A word w over E is any finite sequence of 
letters. The empty word is denoted by e. Let E* be the set of all words over E. 
A language over E is a subset of E* . The set R of regular expressions (RE) over 
E is defined by: 



where the operator • (concatenation) is often omitted. The language £{a) associ- 
ated to a G R is inductively defined as follows: £(0) = 0, £(e) = {e}, £{<j) = {a} 
for a € E, C((a+p)) = C(a)UC(j3), £{{a-P)) = £{a)-£{/3), and £(a*) = £{a)*. 
Two regular expressions a and (5 are equivalent if £(a) = £(/?), and we write 
a — (3. The algebraic structure (R, +, ■, 0, e) constitutes an idempotent semiring, 
and with the unary operator a Kleene algebra. There are several ways to mea- 
sure the size of a regular expression. The size (or ordinary length) \a\ of a G R 
is the number of symbols in a, including parentheses (but not the operator ■); 
the alphabetic size \a\s (or alph(a)) is its number of letters (multiplicities in- 
cluded); and the reverse polish notation size rpn(a) is the number of nodes in 
its syntactic tree. The alphabetic size is considered in the literature [EKSW05 
the most useful measure, and will be the one we consider here for several RE 
measure comparisons. Moreover all these measures are identical up a constant 
factor if the regular expression is reduced [EKSW05, Th. 3]. Let e(a) be e if 
e G £{a), and otherwise. A regular expression a is reduced if it is normalised 
w.r.t the following equivalences (rules): 

e ■ a — a ■ e — a e + a = a + e = a, where e(a) = e 



A RE can be transformed into an equivalent reduced RE in linear time. 

A nondeterministic automaton (NFA) A is a quintuple (Q, £, 5, qo, F), where 
Q is a finite set of states, E is the alphabet, 5 C Qx ExQ the transition relation, 
qo the initial state, and F C Q the set of final states. The size of an NFA is 
|Q| + |<5|. For q € Q and a e E, we denote by 5(q, a) = {p | (q, cr,p) G 5}, and we 
can extend this notation to w G E* , and to R C Q. The language accepted by A 
is £{A) = {w G E* | S(qa,w) [~l F ^ 0}. Two NFAs are equivalent, if they accept 
the same language. If two NFAs A and B are isomorphic, and we write A ~ B. 
An NFA is deterministic (DFA) if for each pair (q, a) G Q x E there exists at 
most one q' such that (q, a. q') G 5. A DFA is minimal if there is no equivalent 
DFA with fewer states. Minimal DFA are unique up to isomorphism. Given an 
equivalence relation E on Q, for q G Q let [q]E be the class of q w.r.t E, and for 



a := | £ | a G E \ (a + a) | (a ■ a) | a*, 



• a = a • = 
0+a=a+0=a 



o 




= £ 



T C Q let T/e = {[q]e I q £ The equivalence relation _E is rig/i£ invariant 
w.r.t an NFA .4 if J5 C (Q \ F) 2 U F 2 and for any p, <? e Q, a £ £ if p E q, then 
5{p,<j)/ e = 8{q,a)/ E . The quotient automaton A/ E = {Q/e,S,S e , [qo] E ,F/ E ), 
where S E = {([p) E ,cr, [q] E ) | (p,a,q) £ <5}, satisfies £(4) = C{A/e)- Given two 
equivalence relations over a set Q, G and H, we say that G is /mer than iJ (and 
H coarser than G) if and only ii G C H. 

3 Small NFAs from Regular Expressions 

We consider three methods for constructing small NFAs A from a regular ex- 
pression a such that C(A) = C(a), i.e., they are equivalent. 

3.1 Position Automata 

The position automaton construction was independently proposed by Glushkov, 
and McNaughton and Yamada |Glu61|MY60j . Let Pos(a) = {1,2,..., \a\ E } for 
a 6 R, and let Pos (a) = Pos(a) U {0}. We consider the expression a ob- 
tained by marking each letter a with its position i in a, cr.j . The same notation 
is used to remove the markings, i.e., a — a. For a 6 R and i 6 Pos(a), let 
first(a) = {j | 3w e £ ,ajW € last(a) = {j \ 3w G , wo j € 

and follow(a,i) = {j | S ,uai<TjV £ C(a)}. Let follow(a,0) = first(a). 

The position automaton for a £ R is _4 pos (a) = (Pos (a), S, 5 pos , 0, F), with 
^pos = {(i, ctJ, j) I j £ follow(a,i)} and F — last(a) U {0} if e(a) — e, and 
F = last(a), otherwise. We note that the number of states of A pos {a) is ex- 
actly \ol\e + 1- Other interesting property is that *4 pc ,s is homogeneous, i.e., all 
transitions arriving at a given state are labelled by the same letter. Briiggemann- 
Klein [BK93 showed that the construction of ^4 pos can be obtained in 0(n 2 ) 
(n = \a\) if the regular expression a is in the so called star normal form (snf), 
i.e., if for each subexpression 0* of a, Vx £ last(/3), follow(/3, x) nfirst(/3) = and 
s(/3) = 0. For every a £ R there is an equivalent RE in star normal form a* that 
can be computed in linear time and such that A p0 s(ct) ~ „4 pos (a*). 

3.2 Follow Automata 

Hie and Yu [IY03a introduced the construction of the follow automaton from 
a RE. Their initial algorithm begins by converting a £ R into an equivalent s- 
NFA from which the follow automaton Af (a) is obtained. For efficiency reasons 
we implemented that method in the FAdo library. The follow automaton is a 
quotient of the position automaton w.r.t the right-invariant equivalence given 
by the follow relation =/C PoSg that is defined by: 

Vx, y £ Poso(a), x =/ y if (i) both x, y or none belong to last(a) and 

(ii) follow(o!, x) — follow(o:, y) 

Proposition 1 (Hie and Yu, Thm. 23). Af(a)~ A pos (a)/= r 



3.3 Partial Derivative Automata 



Let S U {/3} be a set of regular expressions. Then 5 0/3 = {a/3 | a € S} if /3 7^ 
and 5 = 0. For a € R and a G the set <9 CT (a) of partial derivatives of a 
w.r.t. cr is defined inductively as follows: 

W = { otherwise W ) - / W © ^ cM/3) if e (a) = * 



<9 CT (a*) = d a (a) © a* 



(a) /3 otherwise. 



This definition can be extended to sets of regular expressions, words, and lan- 
guages. Given a € R and cr G E, d a (S) = Li ae sd a (a) for SCR, d E (a) — {a}, 
9^,7 (a) = d a (d w {a)) for iu € E*, and 8l(cx) — U weL d w {a) for L C Z/*. The set 
of partial derivatives of a is denoted by PD(a) = {d w (a) \ w € £>*}■ 

Given a regular expression a, the partial derivative automaton A p< i(a), in- 
troduced by Mirkin and Antimirov [Mir66IAnt96j . is defined by 

A pd (a) = (PD(a), S, 5 pd , a, {q e PD(a) | e(g) = e}), 

where 5 p d(q,p) = d a (q), for all g g PD(a) and a e Z 1 . 

Proposition 2 (Mirkin and Antimirov). £(_4 p d(a)) = C(pt). 

Champarnaud and Ziadi CZ02; showed that the partial derivative automaton is 
also a quotient of the position automaton. Champarnaud et al. |COZ07| proved 
that for RE reduced and in star normal form the size of its partial derivative 
automaton A v d is always smaller than the one of its follow automaton At. 



3.4 Complexity 

The automata here presented >4pos, Af and A v d can in worst-case be constructed 
in time and space C(n 2 ), and have, in worst-case, size 0{n 2 ), where n is the size 
of the RE. Recently, Nicaud |Nic09j showed that on the average-case the size of 
the A V o S automata is linear. The best worst case construction of e-free NFAs from 
RE is the one presented by Hromkovic et al. |HSW01j that can be constructed 
and have size 0(n(log n 2 )). However this construction is not considered here. 



4 NFAs Reduction with Equivalences 

It is possible to obtain in time 0(nlogn) a (unique) minimal DFA equivalent 
to a given one. However NFA state minimization is PSPACE-complcte and, in 
general, minimal NFAs are not unique. Considering the exponential succinctness 
of NFAs w.r.t DFAs, it is important to have methods to obtain small NFAs. Any 
right-invariant equivalence relation over Q w.r.t A can be used to diminish the 
size of A (by computing the quotient automaton). The coarsest right-invariant 
equivalence =r can be computed by an algorithm similar to the one used to 
minimize DFAs IY03bj. This coincides with the notion of (auto)-bisimulation, 



widely applied to transition systems and which can be computed efficiently (in 
almost linear time) by the Paige and Tarjan algorithm [PT87] . A left-invariant 
equivalence relation on Q w.r.t A is any right-invariant equivalence relation on 
the reversed automaton of A, A r = (Q, 17, £ r , F, {#o}) 5 where q E 5 r (p 1 a) if 
^ G S(q,a) (and we allow multiple initial states). The coarsest left-invariant 
equivalence on Q w.r.t „4, ~h-> is =i? of „4 r . 



5 FAdo Implementations 
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+linearForm{ ) ; diet 
+snf ( ) : regexp 
+nfaPD(): NFA 
+nfaPosition( ) : NFA 
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+nfaFollowEpsilon( ) : NFAr 
+nfaFollow(): NFA 
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Fig. 1. FAdo classes for REs 



FAdo |FAdlOIMR05lAAA + 09| is an ongoing project that aims to provide a 
set of tools for symbolic manipulation of formal languages. To allow high-level 
programming with complex data structures, easy prototyping of algorithms, and 
portability are its main features. It is mainly developed in the Python program- 
ming language. In FAdo, regular expressions and finite automata are imple- 
mented as Python classes. 

Figure [l] presents the classes for REs and the main methods described in this 
paper. The regexp class is the base class for all REs and the class position is 
the base class for marked REs. The methods f irstO, lastO and followMapO 
(where follow(a, x) — {(3 \ e f ollowMapO }) are coded for each subclass. 

The method nf aPositionO implements a construction of the -A p0 s automaton 
without reduction to snf. Briiggemann-Klein algorithm is implemented by the 



nfaPSNFO method. The methods nf aFollowEpsilonO and nfaFollowO im- 
plement the construction of the At via an e-NFA. The exact text of all these 
algorithms is too long to present here. The method nfaPDO computes the _4 p( j 
and uses the method linearFormO . This method implements the function lf() 
defined by Antimirov [Ant96 to compute the partial derivatives of a RE w.r.t 
all letters. Algorithm [I] presents the computation of the *4 p d- 

Algorithm 1 Computation of A p d 
Q<-M 

F <- 
stack <s— {a} 

while pd <s— POP (stack) do 
for (head, tail) G If (pd) do 
if -i tail G Q then 
Q^QU {tail} 
PUSH(stack,pd) 
end if 

<5(pd,head) <— <5(pd,head) U {tail} 
end for 
if e(pd) then 

F <- FU {pd} 
end if 
end while 



Figure [2] presents the classes for finite automata. FA is the abstract class 
for finite automata. The class NFAr includes the inverse of the transition re- 
lation, that is not included in the NFA class for efficiency reasons. In the NFA 
class the method autobisimulationO implements a naive version for compute 
=r, as presented in Algorithm [2] Given an equivalence relation the method 
equivReducedO builds the quotient automaton. Given an NFA A, A.rEquivO 
corresponds to A/= R , A.lEquivO to A/= L and A.lrEquivO to (A/= L )/= R . 
We refer the reader to Gouveia |Gou09j and to FAdo webpage [FAdlOj for more 
implementation details. 



Algorithm 2 Computation of the set R corresponding to =r. 

for (p,q) G Q x Q do 

it p £ F q£ F then 
R+- RU{(p,q)} 

end if 
end for 

while 3(x,y) ^ R: 3a £ E: 3z G 8(x,u): G 8(y,a): zRw do 

R^RU{(x,y),(y,x)} 
end while 
R^- (Q x Q)\R 
Return R 



FA 

+Statee: list of state names 
+Sigma: set of symbols 
+Initial: state index 
+Final : set of states indexes 

+delta: (state index, symbol) keys to set of state indexes 
+trira( ) 
+trimStates ( ) 

A 

NFA 

+deleteStatcs (del_states : list of state indexes) 
+closeEpsilon(state) 

+epsilonPaths ( start : state index , end : state index): set of states 
+autobisimulation( ) : set of pairs of state indexes 
+autobisimulation2 ( ) : list of pairs of state indexes 
+equivReduced( ) : NFA 
+rEquivNFA( ) ; NFA 
+lEquivNFA( ) : NFA 
+lrEquivNFA() : NFA 

t 

NFAr 

+mergeIncomingEpsilon( state : state index) : state index 
+mergeOutgoingEpsilon( state: state index) : state index 
+mergeStates(f: state index ,t:state index) 
+morguStatcsSct (tomergc : set , target : state index) 



Fig. 2. FAdo classes for NFAs 

6 RE Random Generator 

Uniform random generators are essential to obtain reliable experimental results 
that can provide information about the average-case analysis of both compu- 
tational and descriptional complexity. For general regular expressions, the task 
is somehow simplified because they can be described by small unambiguous 
context-free grammars from which it is possible to build uniform random gen- 
erators [Mai94] . In the FAdo system we implemented the method described by 
Mairson IM ai94| for the generation of context-free languages. The method ac- 
cepts as input a context-free grammar and the size of the words to be uniformly 
random generated. 

The random samples need to be consistent and large enough to ensure sta- 
tistically significant results. To have these samples readily available, the FAdo 
system includes a dataset of random RE, that can be accessed online. The cur- 
rent dataset was obtained using a grammar for REs given by Lee and Shallit 
|LS05j . and that is presented in Figure [3j This grammar generates REs normal- 
ized by rules that define reduced REs, except for certain cases of the rule: e + a, 
where e(a) = e. The database makes available random samples of REs with 
different sizes between 10 and 500 and with alphabet sizes between 2 and 50. 

7 Experimental Results 

In order to experiment with several properties of REs and NFAs we developed a 
generic program to ease to add/remove the methods to be applied and to specify 
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Fig. 3. Grammar for almost reduced REs. The start symbol is S. 



the data, from the database, to be used. Here we are interested in the comparison 
of several REs descriptional measures with measures of the NFAs obtained using 
the methods earlier described. 

For REs we considered the following properties: the alphabetic size (alph); 
the rpn size (rpn); test if it is in snf (snf); if not in snf, compute the snf and its 
measures (alph, rpn); test if it is reduced; if not reduced, reduce it and compute 
its measures (alph, rpn); the number of states (sc) and number of transitions (tc) 
of the equivalent minimal DFA. 

For each NFA (-4 pos , Af, and A p d) we considered the following properties: 
the number of states (|Q|); the number of transitions (|<5|); if it is deterministic 
(det); and if it is homogeneous (horn). All these properties were also considered 
for the case where the REs are in snf, and for the NFAs obtained after applying 
the invariant equivalences =r, =l, and their composition. 

All tests were performed on samples of 10, 000 uniformly random generated 
REs. Each sample contains REs of size 50, 100, 200 and 300, respectively. 

Table [l] shows some results concerning REs. The ratio of alphabetic size to 
rpn size is almost constant for all samples. Almost all REs are in snf, so we 
do not presented the measures after transforming into snf. This fact is relevant 
as the REs were generated only almost reduced. The column snfr contains the 
percentage of REs for which their snf are reduced. It is interesting to note that 
the average number of states of the minimal DFA (sc) is near alph (i.e., near 
the number of states of A pos ). The standard deviation is here very high. For 
the sample of size 300, however, 99% of the REs have 160 < sc < 300. More 
theoretical work is needed for a deeper understanding of these results. 

Table 1. Statistical values for RE measures, where (avg) is the average and 
(std) the standard deviation. 



size 


alph 


rpn 


rpn 


snf 


snfr 


sc 


tc 


sc 


tc 


avg 


std 


avg 


std 


alph 


avg 


std 


avg 


std 


alph 


alph 


50 


42 


6.39 


85 


10.80 


2.04 


97% 


99% 


38 


9.42 


44 


6.39 


0.92 


1.05 


100 


77 


10.26 


161 


17.41 


2.08 


93% 


98% 


69 


20.00 


89 


37.47 


0.89 


1.15 


200 


165 


25.75 


340 


43.83 


2.06 


90% 


97% 


160 


91.58 


203 


186.10 


0.97 


1.24 


300 


247 


38.06 


511 


64.96 


2.06 


87% 


95% 


258 


300.01 


343 


617.51 


1.04 


1.4 



Table § and Table [3] show some results concerning the NFAs obtained from 
REs. In Table [2] the values not in percentage are average values. If -4 pos is 
deterministic then the REs is unambiguous (and strong unambiguous, if in 
snf) [B K93] . The results obtained suggest that perhaps 25% of the reduced REs 
are strong unambiguous. Note that if A pos is not deterministic, almost certainly, 
neither A p d nor Af are. For reasonable sized REs, although _4 pos are homoge- 
neous it is unlikely that cither A p d or Af will be so. It is not significant the 
difference between \Qf\ and |Q p d|- On average |<5 pos | seems linear in the size of 
the RE, and that fact was recently proved by Nicaud |Nic09| . 

Table 2. NFA measures. 



size 








|Qpos| 


*^pos 


det 


horn 


\Qf\ 


\<\r 1 


det 


horn 


Qpd 


|«5 P d| 
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50 


43 


51 


49.1% 


100% 


38 


44 


49.3% 


13.7% 


38 


44 


49.4% 


13.6% 


100 


78 


104 


16.0% 


100% 


67 


84 


17.0% 


1.0% 


66 


83 


17.0% 


1.0% 


200 


166 


211 


27.6% 


100% 


148 


175 


27.7% 


1.5% 


146 


173 


27.7% 


1.4% 


300 


248 


317 


23.9% 


100% 


222 


262 


23.9% 


0.5% 


220 


260 


23.9% 


0.5% 



Table 3. Ratios of NFA measures. 
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50 
100 
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300 


1.18 
1.33 
1.27 
1.28 


0.90 
0.85 
0.89 
0.89 


1.02 
1.07 
1.06 
1.06 


0.89 
0.84 
0.88 
0.88 


1.02 
1.05 
1.05 
1.05 


0.86 
0.79 
0.82 
0.82 


0.99 
0.98 
0.99 
0.99 


0.99 
0.99 
0.99 
0.99 



Reductions by =r and =l (or =r o = L ) decrease by less than 2% the size of 
the considered NFAs (-4 pos , Af, and A p d)- In particular the quotient automata 
of A pos are less than 1% smaller than .A p d- In general, we can hypothesize that 
reductions by the coarsest invariant equivalences are not significant when REs 
are reduced (and/or are in snf). 

8 Conclusion 

We presented a set of tools within the FAdo system to uniformly random gen- 
erate REs, to convert REs into e-free NFAs and to simplify both REs and NFAs. 
These tools can be used to obtain experimental results about the relative de- 
scriptional complexity of regular language representations on the average case. 
Our experimental data corroborate some previous experimental and theoretical 
results, and suggest some new hypotheses to be theoretically proved. We high- 
light the two following conjectures. Reduced REs have high probability of being 
in snf. And the .A p d obtained from REs in reduced snf seems to almost coincide 
with quotient automata of -A pos by =n o = L . 

We would like to thank the anonymous referees for their comments that 
helped to improve this paper. 
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